Introduction to Open Data Science - Course Project

Assignment 1

Eeva Vakkari

https://github.com/EevaVakkari/IODS-project

date()
## [1] "Mon Nov 27 15:47:13 2023"

Some feelings after the first week

It has been quite a heavy crash course into the world of R. The book is good, especially with the exercises and notes in the markdown file, but it’s massive for a beginner.

With the R syntax, I still feel almost as if I was illiterate, but I trust I’ll get used to R jargon with time and more exercises.

My aims for the course

After this course, I wish I could work somewhat independently, at least with very basic things, in R. I aim to familiarize myself with it at least to the level where I’ll be able to understand the error messages I get and the solutions I google.


Assignment 2

Eeva Vakkari

https://github.com/EevaVakkari/IODS-project

Introduction and overview

This week, I studied linear regression using a survey data set combined with exam results from a university statistics course. I made a simple linear model and validated it graphically. First two code blocks contain the data wrangling part and, after them, the analysis part starts.

Data wrangling

date()
## [1] "Mon Nov 27 15:47:13 2023"

I performed the data wrangling as instructed:

library(tidyverse) 
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
lrn14$attitude <- lrn14$Attitude / 10
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06",  "D15", "D23", "D31")
lrn14$deep <- rowMeans(lrn14[, deep_questions])
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
lrn14$surf <- rowMeans(lrn14[, surface_questions])
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
lrn14$stra <- rowMeans(lrn14[, strategic_questions])
str(lrn14)
## 'data.frame':    183 obs. of  64 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
dim(lrn14)
## [1] 183  64

At this point, the data set consists of 63 numerical and 1 character variables. All together 183 observations. Now, let’s focus on a subset of the data:

library(dplyr)
keep_columns <- c("gender","Age","Attitude", "deep", "stra", "surf", "Points")
analysis <- select(lrn14, one_of(keep_columns))
analysis <- filter_if(analysis, is.numeric, all_vars((.) != 0)) #removing all zeros
setwd("~/Polyembryony_R/IODS/IODS-project/data")
write.csv(analysis, "learning2014.csv", row.names = F)
a <- read.csv("learning2014.csv") #Checking that csv is OK.
str(a)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(a)
##   gender Age Attitude     deep  stra     surf Points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21

Looking good. Continuing into analysis part of the assignment.

Data analysis

Setting the working directory and reading the csv into a data frame:

setwd("~/Polyembryony_R/IODS/IODS-project/data")
learning2014 <- read.csv("learning2014.csv")
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender Age Attitude     deep  stra     surf Points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21

The data consists of learning approach survey results combined with student success in the exam. There are 7 variables and 166 observations.

pairs(learning2014[-1])

library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Exam success, as represented by “Points”, seems to correlate with attitude. Other plots do not suggest strong correlations between the variables. Let’s study attitude, age and gender as explanatory variables for exam points.

my_model1 <- lm(Points ~ Attitude + Age + gender, data = learning2014)
summary(my_model1)
## 
## Call:
## lm(formula = Points ~ Attitude + Age + gender, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Attitude     0.36066    0.05932   6.080 8.34e-09 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

When studied with two-sided Student’s t-test, it seems that only attitude has statistically significant effect on exam points whereas age and gender do not explain the exam success. Attitude correlates positively with exam points: with better attitude, the exam points tend to be higher. Making a new model with attitude as the only explanatory variable.

my_model2 <- lm(Points ~ Attitude, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Also, the second model shows that attitude clearly has a statistically significant, positive correlation with exam points. Removing age and gender from the model drops adjusted R-squared value by 0.0014, indicating a very minor (i.e. insignificant) drop in the power of the model. Making diagnostic plots with the second model.

plot(my_model2, which = c(1, 2, 5))

Model evaluation

Linear regression model is based on four assumptions: 1. There is a linear relationship between predictor(s) and outcome 2. The residuals are independent 3. The residuals are normally distributed 4. The residuals have equal variance As demonstrated before, the attitude and exam points have linear relationship. We can confidently assume that residuals are independent since the original survey is represent individual students at a single time point, i.e. autocorrelation is not a concern. This is seen as a flat line in the residuals vs. fitted values plot. The Q-Q plot shows that the residuals are not completely normally distributed but quite close to it. The residuals vs leverage plot shows that all data points are nowhere close to Cook’s distance, i.e. don’t have high leverage. So, there are no too influential single data points.


Assignment 3

Eeva Vakkari

https://github.com/EevaVakkari/IODS-project

Introduction and overview

This week we practice logistic regression with a survey study combining alcohol consumption and student’s grades, as well as demographic, social and school related features. The data was uploaded from from UCI machine learning repository http://www.archive.ics.uci.edu/dataset/320/student+performance

date()
## [1] "Mon Nov 27 15:47:17 2023"
library(tidyverse)
library(dplyr)
library(readr)

Reading joined and modified data in and checking out the column names:

setwd("~/Polyembryony_R/IODS/IODS-project/data")
alc <- read_csv("~/Polyembryony_R/IODS/IODS-project/data/alc.csv")
## New names:
## Rows: 370 Columns: 36
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi... dbl
## (18): ...1, age, Medu, Fedu, traveltime, studytime, famrel, freetime, go... lgl
## (1): high_use
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(alc)
## # A tibble: 6 × 36
##    ...1 school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob    
##   <dbl> <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>   
## 1     1 GP     F        18 U       GT3     A           4     4 at_home  teacher 
## 2     2 GP     F        17 U       GT3     T           1     1 at_home  other   
## 3     3 GP     F        15 U       LE3     T           1     1 at_home  other   
## 4     4 GP     F        15 U       GT3     T           4     2 health   services
## 5     5 GP     F        16 U       GT3     T           3     3 other    other   
## 6     6 GP     M        16 U       LE3     T           4     3 services other   
## # ℹ 25 more variables: reason <chr>, guardian <chr>, traveltime <dbl>,
## #   studytime <dbl>, schoolsup <chr>, famsup <chr>, activities <chr>,
## #   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>,
## #   freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>,
## #   failures <dbl>, paid <chr>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>,
## #   alc_use <dbl>, high_use <lgl>

Analysis of student’s high alcohol consumption

I study sex, age, absences and home address (urban/rural) as explanatory factors for a student’s alcohol consumption. I hypothesize that the younger female students having more absences and living in the rural locations consume the highest amount of alcohol. I assume all the variables have equal, additive effect on the response variable. Let’s explore the data:

g1 <- ggplot(data = alc, aes(x = alc_use))
g1 + geom_bar()

It seems that the majority uses relatively low amount of alcohol.

g2 <- ggplot(alc, aes(x = sex, y = alc_use, fill=sex))+
  geom_boxplot()
g2

Surprisingly, this might indicate that my hypothesis about females consuming more alcohol might not be correct. I’ll study this strange thing more in detail.

g3 <- ggplot(alc, aes(x = age, y = alc_use))+
  geom_boxplot()
g3
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Not so informative plot type.

g4 <- ggplot(alc, aes(x = address, y = alc_use, fill=address))+
  geom_boxplot()
g4

My hypothesis about rural living causing high alcohol consumption might well be correct!

g5 <- ggplot(alc, aes(x = alc_use, y = absences, color=absences))+
  geom_point()
g5

The alcohol consumption seems to be quite scattered around when plotted together with absences. No clear indication of my hypothesis being correct.

Logistic regression:

m <- glm(high_use ~ age + sex + address + absences, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + address + absences, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.71552    1.77325  -2.095   0.0361 *  
## age          0.13172    0.10325   1.276   0.2020    
## sexM         1.03363    0.24529   4.214 2.51e-05 ***
## addressU    -0.40391    0.28314  -1.427   0.1537    
## absences     0.09333    0.02356   3.962 7.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 411.27  on 365  degrees of freedom
## AIC: 421.27
## 
## Number of Fisher Scoring iterations: 4
coefficients(m)
## (Intercept)         age        sexM    addressU    absences 
##  -3.7155227   0.1317217   1.0336294  -0.4039128   0.0933257

So, it seems that age and address are statistically insignificant, whereas male sex and high number of absences have statistically highly significant effect on alcohol consumption. Checking the odds ratios and confidence intervals:

OR <- coef(m) %>% exp
CI <- confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.02434271 -7.24462192 -0.2732444
## age         1.14079079 -0.06972570  0.3361414
## sexM        2.81125045  0.55900936  1.5225186
## addressU    0.66770237 -0.95558085  0.1574767
## absences    1.09781924  0.04917901  0.1416402

Odds ratio being > 1, i.e. age, male sex, and absences, are positively correlated to alcohol consumption. Odds ratio for urban address is < 1, indicating that it would be negatively correlated. Confidence intervals reveal that age and address are statistically insignificant, whereas male sex and absences are significant. Making a new model with only statistically significant sex and absences included:

m2 <- glm(high_use ~ absences + sex, data = alc, family = "binomial")
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities)
select(alc, absences, sex, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 5
##    absences sex   high_use probability prediction
##       <dbl> <chr> <lgl>          <dbl>      <dbl>
##  1        3 M     FALSE          0.371      0.371
##  2        0 M     FALSE          0.306      0.306
##  3        7 M     TRUE           0.465      0.465
##  4        1 F     FALSE          0.147      0.147
##  5        6 F     FALSE          0.219      0.219
##  6        2 F     FALSE          0.160      0.160
##  7        2 F     FALSE          0.160      0.160
##  8        3 F     FALSE          0.173      0.173
##  9        4 M     TRUE           0.394      0.394
## 10        2 M     TRUE           0.349      0.349
table(high_use = alc$high_use, prediction = probabilities)
##         prediction
## high_use 0.13551762779056 0.147251609825878 0.159813770121884 0.173229912920424
##    FALSE               28                17                30                15
##    TRUE                 6                 6                 6                 4
##         prediction
## high_use 0.187520951046583 0.202701911186317 0.218780924782799
##    FALSE                11                10                 9
##    TRUE                  2                 3                 3
##         prediction
## high_use 0.235758237182898 0.253625273514685 0.272363804567288
##    FALSE                 7                 9                 4
##    TRUE                  0                 1                 2
##         prediction
## high_use 0.291945259033951 0.30607905167818 0.312330229220817 0.32699514282328
##    FALSE                 3               22                 1               20
##    TRUE                  0                7                 0                7
##         prediction
## high_use 0.333468215141401 0.348622467320154 0.355297646392281
##    FALSE                 3                10                 1
##    TRUE                  0                10                 0
##         prediction
## high_use 0.370891910681723 0.377746212171274 0.39372404069047 0.417029993594045
##    FALSE                16                 1               13                 6
##    TRUE                  3                 2                9                 3
##         prediction
## high_use 0.424162067815099 0.440712681060035 0.464668291130248
##    FALSE                 0                 7                 2
##    TRUE                  1                 2                 3
##         prediction
## high_use 0.471955510526924 0.488788039009662 0.49610297710717 0.512960107955271
##    FALSE                 1                 5                0                 1
##    TRUE                  0                 5                1                 4
##         prediction
## high_use 0.520268637326094 0.537071708240416 0.544339820805449 0.56101117435482
##    FALSE                 2                 2                 1                1
##    TRUE                  0                 2                 1                3
##         prediction
## high_use 0.584670018040823 0.607944857721631 0.630739153048458
##    FALSE                 0                 0                 0
##    TRUE                  4                 1                 4
##         prediction
## high_use 0.680934752085533 0.695404986885368 0.715493971837525
##    FALSE                 0                 0                 0
##    TRUE                  1                 1                 1
##         prediction
## high_use 0.721413863681047 0.844995611964618 0.916987057630397
##    FALSE                 0                 0                 0
##    TRUE                  1                 1                 1
##         prediction
## high_use 0.924057950235558
##    FALSE                 1
##    TRUE                  0
g6 <- ggplot(alc, aes(x = high_use, y = probabilities))+
  geom_boxplot()
g6

summary(m2)
## 
## Call:
## glm(formula = high_use ~ absences + sex, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.85303    0.22609  -8.196 2.49e-16 ***
## absences     0.09671    0.02336   4.140 3.48e-05 ***
## sexM         1.03451    0.24395   4.241 2.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 415.66  on 367  degrees of freedom
## AIC: 421.66
## 
## Number of Fisher Scoring iterations: 4

I’m not quite sure wheter I got the cross tabulation right. The second model looks good statistics-wise. Trying to compute the average number of wrong predictions:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3

So, it seems that the error is smaller with the model than by guessing. With guessing, it would be 0.5.


Assignment 4

Eeva Vakkari

https://github.com/EevaVakkari/IODS-project

Introduction

This week, we use the Boston data set which contains various housing values of the suburbs. The variables include socioeconomic, demographic and geographic information. In this exercise, I study potentially explanatory variables for crime rate with clustering and linear discriminant analysis (LDA).

date()
## [1] "Mon Nov 27 15:47:25 2023"

Loading required packages:

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(tidyverse)
library(dplyr)
library(readr)

Numerical and graphical overview of the data

Reading in the data. The data set is inbuilt in the MASS package. Thus, it can be read into R just like this:

data("Boston")

I have a look at the data:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The data set is now a data frame and it is comprised of 506 observations and 14 variables. All of the variables are numerical; either integer or numeric (decimal).

The variables
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centers.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.

Distributions and relationships of the variables

Next, I’m creating a graphical overview of the data. I draw a plot matrix with pairs() and, then, a correlation plot with corrplot().

pairs(Boston)

This looks at bit messy. I think I could fix it by adding arguments to pairs() but I’d rather try a different approach to get there histograms to better study distributions. I also try to include there Spearman correlation coefficents.

library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.3.2
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(Boston, histogram = TRUE, method = "spearman")
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter

I got plenty of warnings, and it is still a bit messy looking plot but I got there the histograms and the Spearman correlation coefficients. Now, it is easier to inspect the distributions and correlations. The distributions of crim, zn, age, dis, ptratio, black, and lstat are very skewed. indus, rad, and tax have two peaks. chas is encoded as 0 and 1. nox, rm, and medv are approximately somewhat normally distributed. The Spearman correlation coefficients indicate several positive and negative correlations between the variables which are statistically significant.
I draw a correlation plot to visualize the correlations:

cor_matrix <- cor(Boston) 
corrplot(cor_matrix, method="circle")

The majority of the variables are inter-correlated. Only chas seems to have very weak, if any, correlation to other variables.

Standardization of the data

I’m performing the scaling as in the exercise, with function scale()

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"

The means are now at zero for all variables. The scaled data is in form of matrix array.
I make the scaled data to be a data frame:

boston_scaled <- as.data.frame(boston_scaled)

Making a categorical variable of the crime rate

First, I’m checking out the quantiles:

boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

Next, I’m using the quantiles in making of the categories for crime rate:

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

It seems to work! I have dropped out the old crim and got in a new crime variable which has crime rate gategories based on the quantiles.

Train and test sets

Now, I’m making the train and test sets. I pick randomly 80 % of the rows into the train set. I also save the correct classes and remove the crime variable from the test data. Here, I’m following closely the exercise 4 codes.

boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",
                            sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
nrow(boston_scaled)
## [1] 506
n <- 506
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Now, the data is prepared for LDA.

Linear discriminant analysis (LDA)

With the LDA, it is possible to find linear combination of the variables that separate the target variable classes, i.e. we should be able to distinguish the variables which predict certain crime rate classes in our data. This all new to me, thus, I follow the exercise 4 in my code writing.

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2400990 0.2425743 0.2599010 
## 
## Group means:
##                   zn      indus         chas        nox         rm        age
## low       0.98130174 -0.8814373 -0.158758869 -0.8822704  0.4254563 -0.8961593
## med_low  -0.08786397 -0.2774759 -0.028797094 -0.5397107 -0.1020123 -0.3438514
## med_high -0.38048487  0.1787492  0.129415854  0.3631455  0.1163410  0.4177353
## high     -0.48724019  1.0170492 -0.009855719  1.0439596 -0.4222663  0.8256565
##                 dis        rad        tax     ptratio      black        lstat
## low       0.8505863 -0.6870244 -0.7384053 -0.45735515  0.3728526 -0.756470945
## med_low   0.3268949 -0.5485321 -0.4697184 -0.07374657  0.3123220 -0.125204659
## med_high -0.3587538 -0.4299039 -0.3327700 -0.27875657  0.1038902  0.007387289
## high     -0.8653078  1.6388211  1.5145512  0.78158339 -0.8422560  0.878845411
##                 medv
## low       0.52283195
## med_low   0.02389567
## med_high  0.17372924
## high     -0.65159710
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11470877  0.72682259 -0.92555860
## indus    0.10081048 -0.22310651  0.20141305
## chas    -0.08513300 -0.05059928  0.11441364
## nox      0.34547306 -0.76250231 -1.29792459
## rm      -0.09240621 -0.09539523 -0.09485993
## age      0.26692014 -0.41418563 -0.22036724
## dis     -0.02022357 -0.39466670  0.15366843
## rad      3.41169110  1.03084537 -0.23913495
## tax     -0.01024150 -0.11823572  0.74488787
## ptratio  0.12555508 -0.00213629 -0.18170133
## black   -0.10072485  0.02396385  0.14198129
## lstat    0.19571461 -0.19879133  0.43729793
## medv     0.18141796 -0.36455746 -0.15805353
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9560 0.0342 0.0098
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

It seems that rad has a very big impact on the crime rate, i.e. the geographical location of a suburb related to circular highways might the key explanatory variable behind medium high and high crime rates of the suburb in question.
Now, I check how good my model was by cross-tabulations:

lda.pred <- predict(lda.fit, newdata = test)
lda.pred
## $class
##   [1] med_low  med_low  med_low  med_low  med_high med_high med_high med_high
##   [9] med_high med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##  [17] med_low  med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##  [25] med_low  med_low  med_high med_high med_high med_high med_high med_high
##  [33] med_high med_high med_high med_high med_high med_low  med_low  med_low 
##  [41] low      low      low      low      med_low  med_high med_low  med_high
##  [49] med_low  med_low  low      med_low  med_low  low      med_high med_high
##  [57] med_high low      low      low      low      low      low      low     
##  [65] low      med_low  med_low  med_low  med_low  med_low  med_low  low     
##  [73] med_low  low      high     high     high     high     high     high    
##  [81] high     high     high     high     high     high     high     high    
##  [89] high     high     high     high     high     high     high     high    
##  [97] high     high     med_low  med_high med_low  med_low 
## Levels: low med_low med_high high
## 
## $posterior
##              low      med_low     med_high         high
## 2   2.703447e-01 6.400518e-01 8.960347e-02 3.905013e-22
## 5   3.080057e-01 5.357243e-01 1.562700e-01 2.258649e-21
## 6   3.659020e-01 5.414667e-01 9.263133e-02 2.012276e-21
## 12  1.006220e-01 5.529219e-01 3.464561e-01 8.014783e-17
## 24  3.966724e-02 4.445502e-01 5.157826e-01 4.196191e-16
## 27  6.267873e-02 4.366833e-01 5.006380e-01 1.073697e-16
## 28  5.822303e-02 4.361016e-01 5.056754e-01 1.892049e-16
## 32  7.625501e-02 3.994259e-01 5.243191e-01 9.076984e-17
## 35  4.214832e-02 3.872311e-01 5.706206e-01 9.477272e-16
## 46  3.725453e-01 6.138410e-01 1.361374e-02 2.736375e-22
## 48  9.146284e-02 8.219337e-01 8.660343e-02 5.904302e-20
## 49  2.232257e-02 8.829969e-01 9.468052e-02 1.835712e-18
## 52  4.589077e-01 4.962141e-01 4.487816e-02 7.185357e-20
## 59  4.159103e-01 4.732055e-01 1.108842e-01 3.155053e-14
## 62  1.159285e-01 4.604744e-01 4.235971e-01 1.366144e-11
## 63  2.690906e-01 4.257490e-01 3.051604e-01 4.343400e-13
## 69  2.599677e-01 7.310560e-01 8.976284e-03 1.189850e-20
## 71  3.265880e-01 6.665241e-01 6.887958e-03 4.521689e-22
## 73  3.556591e-01 6.385916e-01 5.749314e-03 4.073255e-22
## 80  1.301119e-01 8.491918e-01 2.069633e-02 4.819585e-19
## 86  4.029098e-01 5.488332e-01 4.825706e-02 1.252266e-21
## 88  4.269330e-01 5.437776e-01 2.928939e-02 1.705400e-21
## 96  4.733187e-01 4.969850e-01 2.969630e-02 4.968779e-23
## 105 8.187033e-02 5.114661e-01 4.066636e-01 1.644736e-15
## 110 5.821532e-02 5.244957e-01 4.172890e-01 2.845974e-15
## 120 6.719456e-02 6.540131e-01 2.787924e-01 5.788198e-15
## 130 9.394598e-03 2.126206e-01 7.779848e-01 6.782821e-15
## 134 7.092435e-03 1.460059e-01 8.469017e-01 4.330732e-15
## 139 7.236625e-03 2.074336e-01 7.853298e-01 1.158557e-14
## 142 1.664379e-03 2.203272e-01 7.780084e-01 5.313455e-13
## 150 1.292548e-04 2.454433e-03 9.974163e-01 5.896169e-13
## 153 5.643480e-04 5.422508e-03 9.940131e-01 9.255719e-15
## 154 9.608007e-05 1.184103e-03 9.987198e-01 5.831901e-13
## 156 2.489538e-04 2.571774e-03 9.971793e-01 2.134102e-14
## 162 2.623699e-03 3.227250e-02 9.651038e-01 1.008482e-15
## 164 2.115516e-03 4.477564e-02 9.531088e-01 2.935917e-17
## 172 1.528660e-02 2.528862e-01 7.318272e-01 3.369813e-15
## 173 1.097056e-01 5.962236e-01 2.940708e-01 4.529710e-16
## 184 3.027720e-01 3.565750e-01 3.406530e-01 1.661458e-19
## 185 2.241160e-01 5.425770e-01 2.333071e-01 7.911481e-19
## 192 7.946252e-01 1.949061e-01 1.046875e-02 8.638847e-20
## 196 9.854254e-01 1.118690e-02 3.387729e-03 8.339604e-21
## 200 9.926721e-01 7.108567e-03 2.192836e-04 3.576464e-23
## 201 9.935864e-01 6.246122e-03 1.674924e-04 1.855494e-23
## 216 2.494042e-01 6.614138e-01 8.918201e-02 2.242789e-19
## 220 4.401702e-02 4.387829e-01 5.172001e-01 2.627223e-17
## 223 4.987389e-02 4.912725e-01 4.588536e-01 6.274782e-14
## 225 2.116847e-02 1.193887e-01 8.594429e-01 6.254932e-13
## 230 2.809951e-01 5.356419e-01 1.833630e-01 1.385120e-14
## 237 4.892541e-02 5.274587e-01 4.236159e-01 4.265046e-14
## 239 7.170344e-01 2.736750e-01 9.290610e-03 7.907950e-19
## 241 4.672221e-01 4.857149e-01 4.706309e-02 3.023368e-17
## 250 3.951563e-01 5.598800e-01 4.496370e-02 8.980062e-17
## 252 5.466164e-01 4.306469e-01 2.273666e-02 2.608869e-17
## 259 3.906684e-02 2.661556e-02 9.343176e-01 1.308667e-15
## 267 6.170598e-02 6.459353e-02 8.737005e-01 1.800429e-15
## 269 3.569230e-01 1.425165e-01 5.005605e-01 1.350898e-17
## 270 4.933206e-01 4.772685e-01 2.941089e-02 2.465499e-21
## 275 8.852977e-01 1.079193e-01 6.782991e-03 8.482814e-22
## 278 8.628365e-01 1.291008e-01 8.062660e-03 6.913498e-22
## 280 7.249469e-01 2.508147e-01 2.423836e-02 1.060579e-19
## 285 9.939991e-01 5.892135e-03 1.087475e-04 2.173270e-26
## 288 8.565446e-01 1.377286e-01 5.726805e-03 3.319339e-18
## 301 9.369013e-01 6.038713e-02 2.711596e-03 6.143667e-20
## 306 7.072157e-01 1.761573e-01 1.166270e-01 2.922027e-14
## 311 3.169611e-01 6.071304e-01 7.590856e-02 1.828414e-18
## 322 2.466634e-01 5.881756e-01 1.651610e-01 7.428116e-18
## 327 3.170731e-01 6.015152e-01 8.141175e-02 7.042099e-19
## 335 3.119865e-01 5.038217e-01 1.841918e-01 2.526501e-18
## 338 2.039437e-01 5.296042e-01 2.664521e-01 3.855031e-17
## 339 3.753710e-01 4.918654e-01 1.327635e-01 5.080750e-18
## 342 8.287133e-01 1.608069e-01 1.047982e-02 2.101836e-25
## 343 2.703284e-01 6.971892e-01 3.248238e-02 3.720939e-25
## 351 8.450283e-01 1.499003e-01 5.071402e-03 3.538171e-25
## 360 5.435429e-21 4.758400e-18 4.754050e-14 1.000000e+00
## 367 8.962277e-22 1.031867e-18 4.383328e-15 1.000000e+00
## 378 2.552513e-20 4.400917e-17 4.470350e-14 1.000000e+00
## 379 9.984205e-21 2.259563e-17 1.943714e-14 1.000000e+00
## 380 1.806113e-20 3.185723e-17 2.456446e-14 1.000000e+00
## 385 6.068870e-23 2.616639e-19 2.497620e-16 1.000000e+00
## 397 2.238909e-20 3.160179e-17 4.287190e-14 1.000000e+00
## 401 4.247972e-21 1.118570e-17 9.159669e-15 1.000000e+00
## 404 1.353884e-20 1.908754e-17 1.602323e-14 1.000000e+00
## 405 2.342263e-21 6.122816e-18 4.439292e-15 1.000000e+00
## 425 1.014968e-19 9.762811e-17 1.645300e-14 1.000000e+00
## 442 1.561131e-21 2.714906e-18 1.558830e-14 1.000000e+00
## 447 2.090395e-21 2.690342e-18 1.536366e-14 1.000000e+00
## 450 3.439769e-21 5.352671e-18 1.701798e-14 1.000000e+00
## 452 7.791447e-21 1.149020e-17 3.969347e-14 1.000000e+00
## 457 3.186431e-22 4.233903e-19 2.130772e-15 1.000000e+00
## 459 1.480712e-20 1.660838e-17 4.465137e-14 1.000000e+00
## 461 7.530730e-21 9.116595e-18 3.671435e-14 1.000000e+00
## 467 1.206135e-21 1.930847e-18 4.326702e-15 1.000000e+00
## 470 9.931902e-18 1.144541e-14 8.018002e-13 1.000000e+00
## 473 2.193699e-18 3.356047e-15 6.122342e-13 1.000000e+00
## 477 1.560073e-19 3.142934e-16 1.168528e-13 1.000000e+00
## 486 7.109969e-17 6.281408e-14 5.694631e-12 1.000000e+00
## 487 1.591746e-18 2.883190e-15 4.892601e-13 1.000000e+00
## 490 2.992993e-03 6.001846e-01 3.968224e-01 1.134693e-14
## 493 5.270230e-03 4.629735e-01 5.317562e-01 9.084257e-16
## 495 8.850582e-02 4.760897e-01 4.354044e-01 1.031836e-14
## 502 3.109390e-01 3.749625e-01 3.140985e-01 1.274534e-21
## 
## $x
##            LD1         LD2         LD3
## 2   -3.4806263 -0.68054120  0.65387497
## 5   -3.2966890 -0.73063782  0.13317537
## 6   -3.3078835 -0.39960536  0.31196492
## 12  -2.0829813 -0.61654161  0.42157571
## 24  -1.8485947 -1.06402132  0.50551853
## 27  -2.0242556 -0.97801487  0.25025358
## 28  -1.9570876 -0.96078423  0.28626946
## 32  -2.0512379 -0.92840722  0.02981485
## 35  -1.7577725 -1.00551863  0.28504124
## 46  -3.4952090  0.38779076  1.37638129
## 48  -2.8696350 -0.66376706  1.53359789
## 49  -2.4142828 -1.00217862  2.33498237
## 52  -2.9009956  0.41885079  0.48067182
## 59  -1.4597305  1.17541220  0.10103348
## 62  -0.7401976  0.51684412  0.11992795
## 63  -1.1610736  0.72452680 -0.26552908
## 69  -3.0512842  0.80111602  1.97550414
## 71  -3.4202557  0.72273324  1.87349871
## 73  -3.4310085  0.84365140  1.87336863
## 80  -2.6237281  0.42370021  2.10264717
## 86  -3.3524836 -0.07125403  0.59676320
## 88  -3.3098654  0.23888336  0.80665131
## 96  -3.7080120 -0.06252311  0.63381885
## 105 -1.7344650 -0.49418328  0.38757283
## 110 -1.6563787 -0.60579182  0.58580682
## 120 -1.5839975 -0.27474659  0.94055973
## 130 -1.4419970 -1.61872292  0.32740037
## 134 -1.4651801 -1.81850861  0.04678028
## 139 -1.3674433 -1.68634002  0.43969379
## 142 -0.8628693 -1.96281387  1.31226209
## 150 -0.5544665 -3.07464693 -2.06704708
## 153 -1.1271706 -2.84730451 -2.06037614
## 154 -0.5130311 -3.18568727 -2.65839055
## 156 -0.9625698 -3.10761290 -2.38670094
## 162 -1.5223655 -2.42089498 -1.04556619
## 164 -1.9183978 -2.86241581 -0.60587861
## 172 -1.5513427 -1.44519724  0.27212903
## 173 -1.8930211 -0.32920220  0.54349334
## 184 -2.8167000 -0.70625862 -0.64214250
## 185 -2.6332654 -0.50792927  0.14694177
## 192 -2.8437725  1.44901718 -0.04742796
## 196 -2.9871852  1.97815123 -2.55015172
## 200 -3.5216544  2.86081500 -1.68443390
## 201 -3.5845370  2.93872273 -1.68703656
## 216 -2.7665075 -0.09899163  0.76596712
## 220 -2.1637853 -1.28756298  0.42073673
## 223 -1.3016351 -0.42078579  0.56995483
## 225 -0.9600419 -0.85414605 -0.72658987
## 230 -1.5463575  0.66223433  0.18331312
## 237 -1.3446998 -0.42814539  0.69109688
## 239 -2.6004352  1.66877577  0.42738153
## 241 -2.2260924  0.98887501  0.45666013
## 250 -2.0995724  1.03924554  0.72070731
## 252 -2.2307913  1.41815512  0.61073062
## 259 -1.6296672 -1.17846086 -2.67072559
## 267 -1.6495891 -0.93779307 -1.96875212
## 269 -2.3078847 -0.37558470 -1.84187237
## 270 -3.2716247  0.34037026  0.59508020
## 275 -3.3357297  1.28748069 -0.52057694
## 278 -3.3675904  1.16287528 -0.40977990
## 280 -2.8433834  0.99313922 -0.15632842
## 285 -4.3283414  2.50639656 -1.56695895
## 288 -2.4136473  2.15353968 -0.12506622
## 301 -2.8183049  2.21206574 -0.66873124
## 306 -1.4613711  1.40683962 -1.22424593
## 311 -2.5378684  0.29509469  0.64087184
## 322 -2.3832755 -0.07493998  0.36231555
## 327 -2.6458266  0.16740273  0.59132273
## 335 -2.5131606 -0.12641588  0.01760629
## 338 -2.1953458 -0.23990048  0.12643642
## 339 -2.4368909  0.19060030  0.06073110
## 342 -4.2854075  0.21991336 -0.33499130
## 343 -4.2401771 -0.84059404  1.21137890
## 351 -4.2098691  0.65079531 -0.05328071
## 360  6.5226109 -0.38734746 -1.29245789
## 367  6.7268048  0.07517008 -0.71190151
## 378  6.3593933  0.26143301  0.19754533
## 379  6.4521191  0.29035742  0.42931227
## 380  6.4027356  0.42305160  0.34871955
## 385  6.9831928  0.38436024  0.74531383
## 397  6.3794993  0.23469800 -0.05193399
## 401  6.5399535  0.31659208  0.53849547
## 404  6.4462472  0.52854507  0.18920579
## 405  6.6096312  0.44020944  0.59835063
## 425  6.2779570  1.35726059  0.77626432
## 442  6.6340600 -0.35452702 -0.64644468
## 447  6.6191148 -0.21769634 -0.80471098
## 450  6.5649596 -0.07010770 -0.41490718
## 452  6.4748127 -0.16266253 -0.48895846
## 457  6.8304718  0.01179025 -0.71510343
## 459  6.4243981  0.05039584 -0.51232871
## 461  6.4868427 -0.13102884 -0.66974729
## 467  6.6881448  0.19434863 -0.22017120
## 470  5.7735085  1.26260434  1.28563747
## 473  5.9052008  0.76872276  0.96668070
## 477  6.1694858  0.51387284  0.77141528
## 486  5.5632128  1.08383621  1.00622723
## 487  5.9327825  0.74556983  1.09385309
## 490 -1.3469631 -1.76326114  2.34405439
## 493 -1.6565072 -1.89894068  1.61569687
## 495 -1.5319810 -0.31433465  0.24768684
## 502 -3.3631972 -1.12672895 -0.58945483
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      11        0    0
##   med_low    4      22        3    0
##   med_high   0       7       19    2
##   high       0       0        0   22

Interpretation and evaluation of the LDA
The model predicts very well the categories of medium high and high crime rates, whereas other categories (medium low and low crime rates) are not as accurately predicted. The model might be useful in detecting areas of high risk for high crime rate, but it shouldn’t be used the other way round, to detect low crime rate suburbs.

Clustering

First, I study Euclidean and Manhattan distances. Not to mess up my previous work, I scale the data again and use boston_scaled2 in this part.

data("Boston")
boston_scaled2 <- scale(Boston)
dist_eu <- boston_scaled2
dist_man <- boston_scaled2
summary(dist_eu)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
summary(dist_man)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Next, I perform a kmeans clustering with seed set at 13 and centers at 4_:

set.seed(13)
km <- kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = 4)

Next, I verify whether these values were OK and visualize the result, as in the exercise 4:

set.seed(123)
# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

It seems that it might be good to have 5 centers.

# k-means clustering
km <- kmeans(boston_scaled2, centers = 5)
# plot the data set with clusters
pairs(boston_scaled2, col = km$cluster)

Interpretation of clustering

Based on the plot above, there are several variables which explain the clustering. For simplicity, I pick up two continuous variables, lstat and medv, which seem to separate the clusters nicely, to closer inspection and plotting with the clusters:

plot_data <- as.data.frame(boston_scaled2)
plot_data$cluster <- factor(km$cluster)
ggplot(plot_data, aes(x = lstat, y = medv)) +
  geom_point(aes(color = cluster)) +
  labs(color = 'Cluster')

To remind, what the variables stand for:
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.

The clusters are not very clearly separated with these two variables, but there are some features of each clusters that can be observed.
Cluster 1 is “the average neighborhood”, characterized by roughly mean values in both variables: the value of owner-occupied homes and percentage of lower status population. Cluster 2 differs from cluster 1 by having higher value (or just more) of owner-occupied homes and lower percentage of lower status population. Cluster 3 is overlapping with other clusters, but its emphasis is in the higher percentage of lower status population and lower value of owner-occupied homes. Cluster 4 is overlapping with cluster 3, but it has even more pronounced pattern of high representation of lower status population and lower value owner-occupied homes. Cluster 5 is best described as Kauniainen, i.e. it has generally high value of owner-occupied homes and low percentage of lower status recidents, but there are still few present (most likely living close to the railway station). ;)

The variables seem to have correlation, which interfers the results. E.g. lstat and medv seem to have negative correlation, meaning the higher the value of owner-occupied homes, the lower the percentage of the lower status population is in the area.

The cluster separation could be more distinct. Now, they overlap quite a lot to my taste. Probably I misinterpreted the elbow plot and the real cluster number should have been less than 5. Also, the data should be cleaned for outliers: I suspect that the spots at max value of medv are actually outliers.

Bonus

I give this a shot in dark, doing copy-paste with the previous code chunks:

boston_scaled3 <- scale(Boston)
kmeans_bonus <- kmeans(boston_scaled3, centers = 5)
Boston$cluster <- factor(kmeans_bonus$cluster)
lda_fit_bonus <- lda(cluster ~ ., data = Boston)

Now, testing visualization:

plot(lda_fit_bonus)

And then, checking out the coefficients:

lda_fit_bonus$scaling
##                   LD1           LD2           LD3          LD4
## crim     0.0080852882 -2.134639e-02 -1.652315e-02  0.005530265
## zn      -0.0031069655 -6.303245e-02  3.040976e-02 -0.051703569
## indus    0.0253157070  5.494003e-02  6.928201e-02 -0.046050129
## chas    -0.7677964641  4.193077e-01 -1.210915e+00 -0.858103945
## nox     -0.4146107663  7.748986e-01  1.300230e+00 -3.497642548
## rm      -0.0100449764 -2.071640e-01 -8.460646e-01 -0.466827093
## age      0.0018135788  1.770948e-02 -7.796642e-03 -0.025090244
## dis     -0.1329179998 -1.605232e-01  1.088635e-01  0.151761364
## rad      0.5010233181 -1.243454e-01 -9.880000e-02  0.170930126
## tax     -0.0009979140 -2.646755e-03  2.713772e-03 -0.005439178
## ptratio  0.0267862421  8.509904e-02  1.383573e-01 -0.126703143
## black   -0.0004856502 -3.039818e-05 -6.642119e-05  0.001016834
## lstat    0.0170738701  8.619402e-03 -7.189688e-03 -0.054987522
## medv    -0.0193304059 -2.865138e-02 -7.933669e-02 -0.046756580

It seems that crim, chas, and nox have the highest impacts in clustering.

Super-Bonus

model_predictors <- Boston[, -which(names(Boston) == "cluster")]
dim(model_predictors)
## [1] 506  14
dim(lda.fit$scaling)
## [1] 13  3

Here, I’m taking the advange of the Bonus exercise:

matrix_product <- as.matrix(model_predictors) %*% lda_fit_bonus$scaling
matrix_product <- as.data.frame(matrix_product)

Getting plotly:

#install.packages("plotly")
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Drawing a 3D plot with the crime categories as coloring:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = boston_scaled$crime) 

Also here, it is clear that the class “high” is standing far out from the other classes.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster) 

It seems that the clusters for crime rate and the clusters based on kmeans represent just the same socioeconomic clusters in the real life!